require(kableExtra)
require(gt)
require(gtsummary)
require(tidyverse)
require(adegenet)
require(knitr)
require(magrittr)
This is document is an R notebook. If you’d like view to pre-rendered figures, read a summary of analysis and interact with code, please open the relevant html file in a browser.
To conduct a similar analyses on your computer, edit or run code: clone this repository into a directory on your local machine and open the .Rproj file in Rstudio.
The project repository is archived at github
Data Sources
Raw data included 1186 individuals from 11 river basins genotyped using the SFGL Ots353 GTseq Panel. After filtering, 977 individuals genotyped at 324 markers remained.
The full genotyping and filtering log is available in the project repository, here
Import Data
#import filtered genotype data
load("../project_genotyping/filtered_genotype_data/genind_2.0.R")
load("../project_genotyping/filtered_genotype_data/genos_2.2.R")
meta_data <- readxl::read_xlsx("../metadata/consolidated_metadata/combined_run_info.xlsx", sheet = 5)
Let’s filter the dataset to only include Umpqua spring Chinook salmon.
First, we’ll summarise what was sampled (not filtered)
kable(meta_data %>%
mutate(carcass = case_when(str_detect(Sample, "OtsCC") ~ "carcass",
TRUE ~ "live")) %>%
mutate(location = str_to_lower(location)) %>%
filter(Population %in% c("NUMP", "SUMP", "UMPR"), run == "Spring") %>%
count(Population, stream, carcass, location), caption = "Umpqua River sample sizes in the before genotype quality filtering") %>%
kable_classic(full_width = F, html_font = "Cambria")
| Population | stream | carcass | location | n |
|---|---|---|---|---|
| NUMP | N. Umpqua River | carcass | boulder to copeland | 7 |
| NUMP | N. Umpqua River | carcass | full flow reach | 7 |
| NUMP | N. Umpqua River | carcass | lower fish creek | 11 |
| NUMP | N. Umpqua River | carcass | slide cr bypass reach | 2 |
| NUMP | N. Umpqua River | carcass | soda springs to boulder cr | 22 |
| NUMP | N. Umpqua River | carcass | upstream of soda springs | 1 |
| NUMP | N. Umpqua River | live | winchester dam or rock cr hatchery | 53 |
| SUMP | S. Umpqua River | carcass | 2823 bridge | 1 |
| SUMP | S. Umpqua River | carcass | bedrock gorge | 2 |
| SUMP | S. Umpqua River | carcass | na | 3 |
| SUMP | S. Umpqua River | carcass | narrows | 1 |
| SUMP | S. Umpqua River | carcass | skillet cr cg | 3 |
| SUMP | S. Umpqua River | live | south umpqua falls | 59 |
Now let’s summarize the data after filtering
kable(genos_2.2 %>%
mutate(carcass = case_when(str_detect(sample, "OtsCC") ~ "carcass",
TRUE ~ "live")) %>%
mutate(location = str_to_lower(location)) %>%
filter(Population %in% c("NUMP", "SUMP"), run == "Spring") %>%
count(Population, stream, carcass, location), caption = "Sample sizes after filtering") %>%
kable_classic(full_width = F, html_font = "Cambria")
| Population | stream | carcass | location | n |
|---|---|---|---|---|
| NUMP | N. Umpqua River | carcass | boulder to copeland | 2 |
| NUMP | N. Umpqua River | carcass | full flow reach | 1 |
| NUMP | N. Umpqua River | carcass | lower fish creek | 2 |
| NUMP | N. Umpqua River | carcass | soda springs to boulder cr | 2 |
| NUMP | N. Umpqua River | carcass | upstream of soda springs | 1 |
| NUMP | N. Umpqua River | live | winchester dam or rock cr hatchery | 49 |
| SUMP | S. Umpqua River | carcass | bedrock gorge | 2 |
| SUMP | S. Umpqua River | carcass | fs-2838 bridge | 1 |
| SUMP | S. Umpqua River | carcass | narrows | 1 |
| SUMP | S. Umpqua River | carcass | skillet cr cg | 2 |
| SUMP | S. Umpqua River | carcass | NA | 2 |
| SUMP | S. Umpqua River | live | south umpqua falls | 58 |
Also, let’s provide a simple, top level summary.
kable(genos_2.2 %>%
mutate(carcass = case_when(str_detect(sample, "OtsCC") ~ "carcass",
TRUE ~ "live")) %>%
mutate(location = str_to_lower(location)) %>%
filter(Population %in% c("NUMP", "SUMP"), run == "Spring") %>%
count(stream), caption = "Sample sizes after filtering") %>%
kable_classic(full_width = F, html_font = "Cambria")
| stream | n |
|---|---|
| N. Umpqua River | 57 |
| S. Umpqua River | 66 |
#let's get just the samples we need
umpr_genos <- genos_2.2 %>%
filter(Population %in% c("NUMP", "SUMP"), run == "Spring")
umpr_inds <- umpr_genos %>%
pull(sample)
genind_umpr <- genind_2.0[umpr_inds]
As a first step let’s conduct a PCA.
# first scale an center dataet and impute missing data to meanallele frequency
X <- scaleGen(genind_umpr, NA.method="mean")
#then run pca, keep all PCs
pca1 <- dudi.pca(X, scale = FALSE, scannf = FALSE, nf = 123)
snp_pcs <- pca1$li#[,c(1:kg)]
#now plot data
snp_pcs %<>%
rownames_to_column(var="sample") %>%
left_join(umpr_genos)
ggplot(data = snp_pcs)+geom_point(aes(Axis1, Axis2, color = Population), alpha = 0.5) +theme_classic()
ggplot(data = snp_pcs)+geom_point(aes(Axis2, Axis3, color = Population), alpha = 0.5) +theme_classic()
ggplot(data = snp_pcs)+geom_point(aes(Axis4, Axis5, color = Population), alpha = 0.5) +theme_classic()
# let's get rid of
plotly::plot_ly(x=snp_pcs$Axis1, y=snp_pcs$Axis2, z=snp_pcs$Axis3, type="scatter3d", mode="markers", color=snp_pcs$Population, alpha = 0.8)
eigs <- as.data.frame(pca1$eig)
eigs %<>%
rowid_to_column()
ggplot(data = eigs)+geom_bar(aes(rowid, `pca1$eig`), stat = "identity", position = "dodge")+theme_classic()+xlab("PCA axis")+ylab("eigenvalue")+ggtitle("Screeplot")
It looks like we have a few outlier individuals driving most of the variation along the primary axes. This can occur when there is little structure to the data and the PCA starts grabbing noise, however because this is a dataset with substantial LD, rare alleles in gene regions overrepresented in our dataset (e.g. Greb1L-Rock1 region) might produce a similar pattern. Let’s look more closely at the outliers to see if this is the case.
Since we know from a previous analysis which Greb1L-Rock1 region markers are in strong LD and which alleles are associated with spring or fall migration timing, we can quickly check if this is the case
Below we count the number of spring (TT), fall (AA), or heterozygous (TA) genotypes at marker Ots28_110073668 at both outlier and non-outlier samples
outlier_inds <- snp_pcs %>%
filter(Axis1 > 5 | Axis2 > 10) %>%
pull(sample)
umpr_genos %<>%
mutate(outlier = case_when(sample %in% outlier_inds ~ "outlier",
TRUE ~ "not_outlier"))
umpr_genos %>%
count(Ots28_11073668, outlier )
There are three individuals that have heterozygous genotypes at the Greb1L-Rock1 region and all three are outliers, but this can’t fully explain the outlier pattern because there are 7 total outliers.
Let’s go further. What markers are driving the first two axes.
kable(pca1$c1 %>%
select(CS1) %>%
slice_max(CS1, n = 40), caption = "top 20 markers loading on PCA axis 1") %>%
kable_classic(full_width = F, html_font = "Cambria") %>%
scroll_box(width = "500px", height = "200px")
| CS1 | |
|---|---|
| Ots28_11071377.C | 0.1489704 |
| Ots28_11072994.T | 0.1489630 |
| Ots28_11073102.A | 0.1489630 |
| Ots28_11073668.A | 0.1489630 |
| Ots28_11075348.A | 0.1489630 |
| Ots28_11075712.T | 0.1489630 |
| Ots28_11077016.T | 0.1489630 |
| Ots28_11077576.G | 0.1489630 |
| Ots37124-12277401.A | 0.1489630 |
| Ots37124-12281207.A | 0.1489630 |
| Ots28_11077172.A | 0.1489176 |
| Ots28_11164637.C | 0.1394907 |
| Ots37124-12310649.T | 0.1394907 |
| Ots28_11160599.G | 0.1394459 |
| Ots28_11095755.A | 0.1394012 |
| Ots28_11206740.T | 0.1253808 |
| Ots28_11201129.T | 0.1253602 |
| Ots28_11205423.A | 0.1253602 |
| Ots28_11070757.G | 0.1122982 |
| Ots37124-12272852.C | 0.0909242 |
| Ots37124-12270118.A | 0.0875610 |
| Ots37124-12267397.C | 0.0847719 |
| Ots28_11202190.T | 0.0835105 |
| Ots28_11062192.G | 0.0813845 |
| Ots28_11186543.A | 0.0774366 |
| Ots_tpx2-125.T | 0.0697439 |
| Ots_U2567-104.A | 0.0589868 |
| Ots_RAG3.T | 0.0481807 |
| Ots28_11210919.C | 0.0447974 |
| Ots_118205-61.T | 0.0444900 |
| Ots_117370-471.G | 0.0424047 |
| Ots_P450-288.G | 0.0385130 |
| Ots_HSP90B-100.C | 0.0368698 |
| Ots_BMP2-SNP1.T | 0.0357583 |
| Ots_110495-380.C | 0.0355402 |
| Ots_U5121-34.G | 0.0351289 |
| Ots_unk1104-38.T | 0.0349621 |
| Ots28_11202400.C | 0.0340045 |
| Ots28_11205993.C | 0.0339713 |
| Ots28_11207428.T | 0.0332233 |
kable(pca1$c1 %>%
select(CS2) %>%
slice_max(CS2, n = 40), caption = "top 20 markers loading on PCA axis 2") %>%
kable_classic(full_width = F, html_font = "Cambria") %>%
scroll_box(width = "500px", height = "200px")
| CS2 | |
|---|---|
| Ots_CHI06048618_5222.G | 0.1431457 |
| Ots11_11925999.G | 0.1322466 |
| Ots4_64978818.C | 0.1303041 |
| Ots_u07-25_325.T | 0.1233726 |
| Ots_u211-85.C | 0.1181128 |
| Ots4_40942276.G | 0.1123654 |
| Ots7_51409415.T | 0.1119642 |
| Ots4_42378741.C | 0.1108644 |
| Ots2_42405643.T | 0.1066928 |
| Ots_nramp-321.G | 0.1052772 |
| Ots9_28975221.A | 0.1037127 |
| Ots7_50997124.G | 0.1020991 |
| Ots_FARSLA-220.G | 0.0907195 |
| Ots_GPH-318.T | 0.0903188 |
| Ots3_57055518.T | 0.0899086 |
| Ots17_1345774_C6.T | 0.0888702 |
| Ots17_1488679_C6.G | 0.0888702 |
| Ots17_885364.C | 0.0876775 |
| Ots29_23344676.C | 0.0868057 |
| Ots_113457-40R.C | 0.0854005 |
| Ots18_32088284.T | 0.0840288 |
| Ots_u07-17_135.A | 0.0834243 |
| Ots_ppie-245.A | 0.0828972 |
| Ots_cox1-241.T | 0.0810016 |
| Ots_unk9480-51.G | 0.0770229 |
| Ots19_46172133.C | 0.0736705 |
| Ots_ZR-575.A | 0.0735555 |
| Ots_P450.T | 0.0683253 |
| Ots6_10904949.C | 0.0676942 |
| Ots_cgo24-22.T | 0.0672572 |
| Ots_IL11.T | 0.0671326 |
| Ots_hnRNPL-533.T | 0.0670826 |
| Ots_FGF6B_1.A | 0.0623263 |
| Ots_crRAD255-59.C | 0.0620136 |
| Ots11_32418659.T | 0.0604396 |
| Ots_RAG3.T | 0.0603979 |
| Ots_101554-407.G | 0.0601383 |
| Ots_HSP90B-100.C | 0.0598335 |
| Ots_104415-88.T | 0.0598167 |
| Ots_Est1363.T | 0.0569970 |
Okay, we can see that axis 1 outliers are driven by the Greb1L-Rock1 region markers, but axis 2 outliers are driven by a unrelated set of markers.
In any case, let’s remove the outliers and conduct a PCA again.
The table below collects the metadata for the 7 outlier individuals.
kable(genos_2.2 %>%
filter(sample %in% outlier_inds) %>%
mutate(carcass = case_when(str_detect(sample, "OtsCC") ~ "carcass",
TRUE ~ "live")) %>%
mutate(location = str_to_lower(location)) %>%
select(sample, carcass, date, Population, location, run, sex, meps, fl, origin) %>%
arrange(Population, location), caption = "Metadata for Outlier Individuals") %>%
kable_classic(full_width = T, html_font = "Cambria") %>%
scroll_box(width = "900px", height = "200px")
| sample | carcass | date | Population | location | run | sex | meps | fl | origin |
|---|---|---|---|---|---|---|---|---|---|
| OtsCC20NUMP_0020 | carcass | 10/6/20 | NUMP | lower fish creek | Spring | F | 745 | NA | NOR |
| OtsAC20NUMP_0017 | live | NA | NUMP | winchester dam or rock cr hatchery | Spring | M | NA | 690 | NOR |
| OtsAC20NUMP_0023 | live | NA | NUMP | winchester dam or rock cr hatchery | Spring | M | NA | 835 | NOR |
| OtsAC20NUMP_0032 | live | NA | NUMP | winchester dam or rock cr hatchery | Spring | M | NA | 840 | NOR |
| OtsCC20SUMP_0002 | carcass | 10/19/20 | SUMP | bedrock gorge | Spring | F | 590 | NA | NOR |
| OtsAC21SUMP_0026 | live | 5/19/21 | SUMP | south umpqua falls | Spring | M | NA | 615 | NOR |
| OtsAC21SUMP_0059 | live | 5/28/21 | SUMP | south umpqua falls | Spring | M | NA | 555 | HOR |
# first scale an center dataet and impute missing data to meanallele frequency
non_outliers <- umpr_genos$sample[!(umpr_genos$sample %in% outlier_inds)]
X2 <- scaleGen(genind_umpr[non_outliers], NA.method="mean")
#then run pca, keep all PCs
pca2 <- dudi.pca(X2, scale = FALSE, scannf = FALSE, nf = 123)
snp_pcs <- pca2$li#[,c(1:kg)]
#now plot data
snp_pcs %<>%
rownames_to_column(var="sample") %>%
left_join(umpr_genos)
ggplot(data = snp_pcs)+geom_point(aes(Axis1, Axis2, color = Population), alpha = 0.5) +theme_classic()
ggplot(data = snp_pcs)+geom_point(aes(Axis2, Axis3, color = Population), alpha = 0.5) +theme_classic()
ggplot(data = snp_pcs)+geom_point(aes(Axis4, Axis5, color = Population), alpha = 0.5) +theme_classic()
# let's get rid of
plotly::plot_ly(x=snp_pcs$Axis1, y=snp_pcs$Axis2, z=snp_pcs$Axis3, type="scatter3d", mode="markers", color=snp_pcs$Population, alpha = 0.8)
eigs <- as.data.frame(pca2$eig)
eigs %<>%
rowid_to_column()
ggplot(data = eigs)+geom_bar(aes(rowid, `pca2$eig`), stat = "identity", position = "dodge")+theme_classic()+xlab("PCA axis")+ylab("eigenvalue")+ggtitle("Screeplot")
No, I don’t suspect there is any substantial structure in the data from PCA results. There is substantial overlap in PC space between our samples of North and South Umpqua spring Chinook salmon.
Let’s estimate the level of differentiation using FST. We will estimate some basic F statistics using hierfstat and also the pairwise Fst using the Weir and Cockerham. (Note 1 that we left the outlier individuals in the dataset for these estimates) (Note 2 that this is a simple calculation of Fstatistics, dataset wide that is unlikely to be reflective of genome-wide neutral varaition becuase of LD and the bias in our panel towards ESTs and adaptive genetic variation)
First some basic F stats
require(hierfstat)
## Loading required package: hierfstat
## Warning: package 'hierfstat' was built under R version 3.6.2
##
## Attaching package: 'hierfstat'
## The following object is masked from 'package:adegenet':
##
## read.fstat
fstat <- genind2hierfstat(genind_umpr)
colnames(fstat) <- c(pop, names(genind_umpr$loc.n.all))
#calculate datset wide basic stats
basicstats <- basic.stats(fstat)
kable(basicstats$overall, caption = "Full Dataset Fstats")
| x | |
|---|---|
| Ho | 0.2833 |
| Hs | 0.2905 |
| Ht | 0.2917 |
| Dst | 0.0012 |
| Htp | 0.2929 |
| Dstp | 0.0024 |
| Fst | 0.0040 |
| Fstp | 0.0081 |
| Fis | 0.0249 |
| Dest | 0.0033 |
Now pairwise FST using Weir and Cockerham
genet.dist(fstat, method="WC84")
## NUMP
## SUMP 0.008064787
As suspected from the PCA results, differentiation between our sample North and South Umpqua River spring run Chinook salmon is low at 0.008.
Overall FST was low (0.008),and PCA did not reveal any clear structure to the data. This suggests there are no strong genetic differences between our samples of North and South Umpqua River spring run Chinook salmon.
Importantly, our GT-seq panel only provides a small snapshot of the genetic variation within and among these groups and other potentially ecologically relevant genetic differences between these groups may still exist.